Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 14 de 14
Filtrar
1.
Cell Syst ; 15(3): 286-294.e2, 2024 Mar 20.
Artigo em Inglês | MEDLINE | ID: mdl-38428432

RESUMO

Pretrained protein sequence language models have been shown to improve the performance of many prediction tasks and are now routinely integrated into bioinformatics tools. However, these models largely rely on the transformer architecture, which scales quadratically with sequence length in both run-time and memory. Therefore, state-of-the-art models have limitations on sequence length. To address this limitation, we investigated whether convolutional neural network (CNN) architectures, which scale linearly with sequence length, could be as effective as transformers in protein language models. With masked language model pretraining, CNNs are competitive with, and occasionally superior to, transformers across downstream applications while maintaining strong performance on sequences longer than those allowed in the current state-of-the-art transformer models. Our work suggests that computational efficiency can be improved without sacrificing performance, simply by using a CNN architecture instead of a transformer, and emphasizes the importance of disentangling pretraining task and model architecture. A record of this paper's transparent peer review process is included in the supplemental information.


Assuntos
Biologia Computacional , Redes Neurais de Computação , Sequência de Aminoácidos , Revisão por Pares
2.
bioRxiv ; 2024 Feb 12.
Artigo em Inglês | MEDLINE | ID: mdl-38405697

RESUMO

Clustering is commonly used in single-cell RNA-sequencing (scRNA-seq) pipelines to characterize cellular heterogeneity. However, current methods face two main limitations. First, they require user-specified heuristics which add time and complexity to bioinformatic workflows; second, they rely on post-selective differential expression analyses to identify marker genes driving cluster differences, which has been shown to be subject to inflated false discovery rates. We address these challenges by introducing nonparametric clustering of single-cell populations (NCLUSION): an infinite mixture model that leverages Bayesian sparse priors to identify marker genes while simultaneously performing clustering on single-cell expression data. NCLUSION uses a scalable variational inference algorithm to perform these analyses on datasets with up to millions of cells. By analyzing publicly available scRNA-seq studies, we demonstrate that NCLUSION (i) matches the performance of other state-of-the-art clustering techniques with significantly reduced runtime and (ii) provides statistically robust and biologically relevant transcriptomic signatures for each of the clusters it identifies. Overall, NCLUSION represents a reliable hypothesis-generating tool for understanding patterns of expression variation present in single-cell populations.

3.
Nat Biomed Eng ; 2(1): 38-47, 2018 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-29998038

RESUMO

The CRISPR-Cas9 system provides unprecedented genome editing capabilities. However, off-target effects lead to sub-optimal usage and additionally are a bottleneck in the development of therapeutic uses. Herein, we introduce the first machine learning-based approach to off-target prediction, yielding a state-of-the-art model for CRISPR-Cas9 that outperforms all other guide design services. Our approach, Elevation, consists of two interdependent machine learning models-one for scoring individual guide-target pairs, and another which aggregates these guide-target scores into a single, overall summary guide score. Through systematic investigation, we demonstrate that Elevation performs substantially better than competing approaches on both tasks. Additionally, we are the first to systematically evaluate approaches on the guide summary score problem; we show that the most widely-used method performs no better than random at times, whereas Elevation consistently outperformed it, sometimes by an order of magnitude. We also introduce an evaluation method that balances errors between active and inactive guides, thereby encapsulating a range of practical use cases; Elevation is consistently superior to other methods across the entire range. Finally, because of the large scale and computational demands of off-target prediction, we have developed a cloud-based service for quick retrieval. This service provides end-to-end guide design by also incorporating our previously reported on-target model, Azimuth. (https://crispr.ml:please treat this web site as confidential until publication).

4.
Nat Biotechnol ; 36(2): 179-189, 2018 02.
Artigo em Inglês | MEDLINE | ID: mdl-29251726

RESUMO

Combinatorial genetic screening using CRISPR-Cas9 is a useful approach to uncover redundant genes and to explore complex gene networks. However, current methods suffer from interference between the single-guide RNAs (sgRNAs) and from limited gene targeting activity. To increase the efficiency of combinatorial screening, we employ orthogonal Cas9 enzymes from Staphylococcus aureus and Streptococcus pyogenes. We used machine learning to establish S. aureus Cas9 sgRNA design rules and paired S. aureus Cas9 with S. pyogenes Cas9 to achieve dual targeting in a high fraction of cells. We also developed a lentiviral vector and cloning strategy to generate high-complexity pooled dual-knockout libraries to identify synthetic lethal and buffering gene pairs across multiple cell types, including MAPK pathway genes and apoptotic genes. Our orthologous approach also enabled a screen combining gene knockouts with transcriptional activation, which revealed genetic interactions with TP53. The "Big Papi" (paired aureus and pyogenes for interactions) approach described here will be widely applicable for the study of combinatorial phenotypes.


Assuntos
Sistemas CRISPR-Cas/genética , Epistasia Genética/genética , Testes Genéticos , RNA Guia de Cinetoplastídeos/genética , Apoptose/genética , Técnicas de Inativação de Genes , Marcação de Genes , Humanos , Aprendizado de Máquina , Quinases de Proteína Quinase Ativadas por Mitógeno/genética , Transdução de Sinais/genética , Staphylococcus aureus/genética , Streptococcus pyogenes/genética , Proteína Supressora de Tumor p53/genética
5.
J Comput Biol ; 24(6): 524-535, 2017 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-28056190

RESUMO

Genome-wide association studies commonly examine one trait at a time. Occasionally they examine several related traits with the hope of increasing power; in such a setting, the traits are not generally smoothly varying in any way such as time or space. However, for function-valued traits, the trait is often smoothly varying along the axis of interest, such as space or time. For instance, in the case of longitudinal traits such as growth curves, the axis of interest is time; for spatially varying traits such as chromatin accessibility, it would be position along the genome. Although there have been efforts to perform genome-wide association studies with such function-valued traits, the statistical approaches developed for this purpose often have limitations such as requiring the trait to behave linearly in time or space, or constraining the genetic effect itself to be constant or linear in time. Herein, we present a flexible model for this problem-the Partitioned Gaussian Process-which removes many such limitations and is especially effective as the number of time points increases. The theoretical basis of this model provides machinery for handling missing and unaligned function values such as would occur when not all individuals are measured at the same time points. Furthermore, we make use of algebraic refactorizations to substantially reduce the time complexity of our model beyond the naive implementation. Finally, we apply our approach and several others to synthetic data before closing, with some directions for improved modeling and statistical testing.


Assuntos
Estudo de Associação Genômica Ampla/métodos , Modelos Genéticos , Modelos Estatísticos , Característica Quantitativa Herdável , Análise de Sequência de DNA/métodos , Simulação por Computador , Humanos , Distribuição Normal , Estatísticas não Paramétricas
6.
Nat Med ; 22(6): 606-13, 2016 06.
Artigo em Inglês | MEDLINE | ID: mdl-27183217

RESUMO

Human leukocyte antigen class I (HLA)-restricted CD8(+) T lymphocyte (CTL) responses are crucial to HIV-1 control. Although HIV can evade these responses, the longer-term impact of viral escape mutants remains unclear, as these variants can also reduce intrinsic viral fitness. To address this, we here developed a metric to determine the degree of HIV adaptation to an HLA profile. We demonstrate that transmission of viruses that are pre-adapted to the HLA molecules expressed in the recipient is associated with impaired immunogenicity, elevated viral load and accelerated CD4(+) T cell decline. Furthermore, the extent of pre-adaptation among circulating viruses explains much of the variation in outcomes attributed to the expression of certain HLA alleles. Thus, viral pre-adaptation exploits 'holes' in the immune response. Accounting for these holes may be key for vaccine strategies seeking to elicit functional responses from viral variants, and to HIV cure strategies that require broad CTL responses to achieve successful eradication of HIV reservoirs.


Assuntos
Adaptação Fisiológica/imunologia , Linfócitos T CD8-Positivos/imunologia , Infecções por HIV/transmissão , HIV-1/imunologia , Antígenos de Histocompatibilidade Classe I/imunologia , Evasão da Resposta Imune/imunologia , Vacinas contra a AIDS/imunologia , África Austral , Colúmbia Britânica , Contagem de Linfócito CD4 , Estudos de Coortes , Evolução Molecular , Infecções por HIV/imunologia , HIV-1/genética , Humanos , Evasão da Resposta Imune/genética , Imunidade Celular/imunologia , Modelos Lineares , Modelos Imunológicos , Modelos de Riscos Proporcionais , Receptores de Antígenos de Linfócitos T/imunologia , Carga Viral , Replicação Viral/genética
7.
Nat Biotechnol ; 34(2): 184-191, 2016 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-26780180

RESUMO

CRISPR-Cas9-based genetic screens are a powerful new tool in biology. By simply altering the sequence of the single-guide RNA (sgRNA), one can reprogram Cas9 to target different sites in the genome with relative ease, but the on-target activity and off-target effects of individual sgRNAs can vary widely. Here, we use recently devised sgRNA design rules to create human and mouse genome-wide libraries, perform positive and negative selection screens and observe that the use of these rules produced improved results. Additionally, we profile the off-target activity of thousands of sgRNAs and develop a metric to predict off-target sites. We incorporate these findings from large-scale, empirical data to improve our computational design rules and create optimized sgRNA libraries that maximize on-target activity and minimize off-target effects to enable more effective and efficient genetic screens and genome engineering.


Assuntos
Sistemas CRISPR-Cas/genética , Engenharia Genética/métodos , Genômica/métodos , RNA Guia de Cinetoplastídeos/genética , Animais , Linhagem Celular Tumoral , Resistência a Medicamentos/genética , Biblioteca Gênica , Genoma/genética , Humanos , Camundongos
8.
Sci Rep ; 4: 6874, 2014 Nov 12.
Artigo em Inglês | MEDLINE | ID: mdl-25387525

RESUMO

We examine improvements to the linear mixed model (LMM) that better correct for population structure and family relatedness in genome-wide association studies (GWAS). LMMs rely on the estimation of a genetic similarity matrix (GSM), which encodes the pairwise similarity between every two individuals in a cohort. These similarities are estimated from single nucleotide polymorphisms (SNPs) or other genetic variants. Traditionally, all available SNPs are used to estimate the GSM. In empirical studies across a wide range of synthetic and real data, we find that modifications to this approach improve GWAS performance as measured by type I error control and power. Specifically, when only population structure is present, a GSM constructed from SNPs that well predict the phenotype in combination with principal components as covariates controls type I error and yields more power than the traditional LMM. In any setting, with or without population structure or family relatedness, a GSM consisting of a mixture of two component GSMs, one constructed from all SNPs and another constructed from SNPs that well predict the phenotype again controls type I error and yields more power than the traditional LMM. Software implementing these improvements and the experimental comparisons are available at http://microsoft.com/science.


Assuntos
Estudo de Associação Genômica Ampla/estatística & dados numéricos , Modelos Lineares , Polimorfismo de Nucleotídeo Único , Software , Algoritmos , Animais , Genótipo , Humanos , Camundongos , Modelos Genéticos , Fenótipo
9.
Nat Commun ; 5: 4890, 2014 Sep 19.
Artigo em Inglês | MEDLINE | ID: mdl-25234577

RESUMO

Linear mixed models (LMMs) are a powerful and established tool for studying genotype-phenotype relationships. A limitation of the LMM is that the model assumes Gaussian distributed residuals, a requirement that rarely holds in practice. Violations of this assumption can lead to false conclusions and loss in power. To mitigate this problem, it is common practice to pre-process the phenotypic values to make them as Gaussian as possible, for instance by applying logarithmic or other nonlinear transformations. Unfortunately, different phenotypes require different transformations, and choosing an appropriate transformation is challenging and subjective. Here we present an extension of the LMM that estimates an optimal transformation from the observed data. In simulations and applications to real data from human, mouse and yeast, we show that using transformations inferred by our model increases power in genome-wide association studies and increases the accuracy of heritability estimation and phenotype prediction.


Assuntos
Modelos Lineares , Modelos Genéticos , Animais , Simulação por Computador , Bases de Dados Factuais , Fungos/genética , Fungos/metabolismo , Estudos de Associação Genética , Estudo de Associação Genômica Ampla , Humanos , Camundongos , Distribuição Normal , Fenótipo , Polimorfismo de Nucleotídeo Único , Leveduras
10.
Elife ; 2: e01123, 2013 Oct 29.
Artigo em Inglês | MEDLINE | ID: mdl-24171102

RESUMO

HIV-1 sequence diversity is affected by selection pressures arising from host genomic factors. Using paired human and viral data from 1071 individuals, we ran >3000 genome-wide scans, testing for associations between host DNA polymorphisms, HIV-1 sequence variation and plasma viral load (VL), while considering human and viral population structure. We observed significant human SNP associations to a total of 48 HIV-1 amino acid variants (p<2.4 × 10(-12)). All associated SNPs mapped to the HLA class I region. Clinical relevance of host and pathogen variation was assessed using VL results. We identified two critical advantages to the use of viral variation for identifying host factors: (1) association signals are much stronger for HIV-1 sequence variants than VL, reflecting the 'intermediate phenotype' nature of viral variation; (2) association testing can be run without any clinical data. The proposed genome-to-genome approach highlights sites of genomic conflict and is a strategy generally applicable to studies of host-pathogen interaction. DOI:http://dx.doi.org/10.7554/eLife.01123.001.


Assuntos
Genoma Humano , Genoma Viral , Infecções por HIV/genética , HIV-1/genética , Polimorfismo de Nucleotídeo Único , Carga Viral/genética , Alelos , Estudo de Associação Genômica Ampla , Infecções por HIV/imunologia , Infecções por HIV/virologia , HIV-1/imunologia , Antígenos de Histocompatibilidade Classe I/genética , Antígenos de Histocompatibilidade Classe I/imunologia , Interações Hospedeiro-Patógeno/genética , Interações Hospedeiro-Patógeno/imunologia , Humanos , Carga Viral/imunologia
11.
Brain ; 136(Pt 11): 3305-32, 2013 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-24065725

RESUMO

Amyotrophic lateral sclerosis is heterogeneous with high variability in the speed of progression even in cases with a defined genetic cause such as superoxide dismutase 1 (SOD1) mutations. We reported that SOD1(G93A) mice on distinct genetic backgrounds (C57 and 129Sv) show consistent phenotypic differences in speed of disease progression and life-span that are not explained by differences in human SOD1 transgene copy number or the burden of mutant SOD1 protein within the nervous system. We aimed to compare the gene expression profiles of motor neurons from these two SOD1(G93A) mouse strains to discover the molecular mechanisms contributing to the distinct phenotypes and to identify factors underlying fast and slow disease progression. Lumbar spinal motor neurons from the two SOD1(G93A) mouse strains were isolated by laser capture microdissection and transcriptome analysis was conducted at four stages of disease. We identified marked differences in the motor neuron transcriptome between the two mice strains at disease onset, with a dramatic reduction of gene expression in the rapidly progressive (129Sv-SOD1(G93A)) compared with the slowly progressing mutant SOD1 mice (C57-SOD1(G93A)) (1276 versus 346; Q-value ≤ 0.01). Gene ontology pathway analysis of the transcriptional profile from 129Sv-SOD1(G93A) mice showed marked downregulation of specific pathways involved in mitochondrial function, as well as predicted deficiencies in protein degradation and axonal transport mechanisms. In contrast, the transcriptional profile from C57-SOD1(G93A) mice with the more benign disease course, revealed strong gene enrichment relating to immune system processes compared with 129Sv-SOD1(G93A) mice. Motor neurons from the more benign mutant strain demonstrated striking complement activation, over-expressing genes normally involved in immune cell function. We validated through immunohistochemistry increased expression of the C3 complement subunit and major histocompatibility complex I within motor neurons. In addition, we demonstrated that motor neurons from the slowly progressing mice activate a series of genes with neuroprotective properties such as angiogenin and the nuclear factor (erythroid-derived 2)-like 2 transcriptional regulator. In contrast, the faster progressing mice show dramatically reduced expression at disease onset of cell pathways involved in neuroprotection. This study highlights a set of key gene and molecular pathway indices of fast or slow disease progression which may prove useful in identifying potential disease modifiers responsible for the heterogeneity of human amyotrophic lateral sclerosis and which may represent valid therapeutic targets for ameliorating the disease course in humans.


Assuntos
Esclerose Amiotrófica Lateral/genética , Progressão da Doença , Neurônios Motores/patologia , Superóxido Dismutase/genética , Transcriptoma/genética , Esclerose Amiotrófica Lateral/patologia , Animais , Modelos Animais de Doenças , Feminino , Camundongos , Camundongos da Linhagem 129 , Camundongos Endogâmicos C57BL , Camundongos Transgênicos , Neurônios Motores/metabolismo , Mutação/genética , Fenótipo , Superóxido Dismutase-1 , Fatores de Tempo
12.
Bioinformatics ; 29(11): 1382-9, 2013 Jun 01.
Artigo em Inglês | MEDLINE | ID: mdl-23559640

RESUMO

MOTIVATION: Genomic studies have revealed a substantial heritable component of the transcriptional state of the cell. To fully understand the genetic regulation of gene expression variability, it is important to study the effect of genotype in the context of external factors such as alternative environmental conditions. In model systems, explicit environmental perturbations have been considered for this purpose, allowing to directly test for environment-specific genetic effects. However, such experiments are limited to species that can be profiled in controlled environments, hampering their use in important systems such as human. Moreover, even in seemingly tightly regulated experimental conditions, subtle environmental perturbations cannot be ruled out, and hence unknown environmental influences are frequent. Here, we propose a model-based approach to simultaneously infer unmeasured environmental factors from gene expression profiles and use them in genetic analyses, identifying environment-specific associations between polymorphic loci and individual gene expression traits. RESULTS: In extensive simulation studies, we show that our method is able to accurately reconstruct environmental factors and their interactions with genotype in a variety of settings. We further illustrate the use of our model in a real-world dataset in which one environmental factor has been explicitly experimentally controlled. Our method is able to accurately reconstruct the true underlying environmental factor even if it is not given as an input, allowing to detect genuine genotype-environment interactions. In addition to the known environmental factor, we find unmeasured factors involved in novel genotype-environment interactions. Our results suggest that interactions with both known and unknown environmental factors significantly contribute to gene expression variability. AVAILABILITY: and implementation: Software available at http://pmbio.github.io/envGPLVM/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Perfilação da Expressão Gênica , Regulação da Expressão Gênica , Interação Gene-Ambiente , Regulação Fúngica da Expressão Gênica , Genótipo , Humanos , Modelos Lineares , Modelos Genéticos , Locos de Características Quantitativas
13.
Acta Neuropathol ; 125(1): 95-109, 2013 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-23143228

RESUMO

A consistent clinical feature of amyotrophic lateral sclerosis (ALS) is the sparing of eye movements and the function of external sphincters, with corresponding preservation of motor neurons in the brainstem oculomotor nuclei, and of Onuf's nucleus in the sacral spinal cord. Studying the differences in properties of neurons that are vulnerable and resistant to the disease process in ALS may provide insights into the mechanisms of neuronal degeneration, and identify targets for therapeutic manipulation. We used microarray analysis to determine the differences in gene expression between oculomotor and spinal motor neurons, isolated by laser capture microdissection from the midbrain and spinal cord of neurologically normal human controls. We compared these to transcriptional profiles of oculomotor nuclei and spinal cord from rat and mouse, obtained from the GEO omnibus database. We show that oculomotor neurons have a distinct transcriptional profile, with significant differential expression of 1,757 named genes (q < 0.001). Differentially expressed genes are enriched for the functional categories of synaptic transmission, ubiquitin-dependent proteolysis, mitochondrial function, transcriptional regulation, immune system functions, and the extracellular matrix. Marked differences are seen, across the three species, in genes with a function in synaptic transmission, including several glutamate and GABA receptor subunits. Using patch clamp recording in acute spinal and brainstem slices, we show that resistant oculomotor neurons show a reduced AMPA-mediated inward calcium current, and a higher GABA-mediated chloride current, than vulnerable spinal motor neurons. The findings suggest that reduced susceptibility to excitotoxicity, mediated in part through enhanced GABAergic transmission, is an important determinant of the relative resistance of oculomotor neurons to degeneration in ALS.


Assuntos
Esclerose Amiotrófica Lateral/genética , Regulação da Expressão Gênica/genética , Medula Espinal/metabolismo , Transmissão Sináptica/genética , Idoso , Esclerose Amiotrófica Lateral/metabolismo , Esclerose Amiotrófica Lateral/fisiopatologia , Feminino , Predisposição Genética para Doença , Humanos , Masculino , Pessoa de Meia-Idade , Neurônios Motores/metabolismo , Neurônios Motores/patologia , Degeneração Neural/genética , Degeneração Neural/prevenção & controle , Receptores de AMPA/genética , Receptores de AMPA/metabolismo , Medula Espinal/patologia , Ácido gama-Aminobutírico/genética , Ácido gama-Aminobutírico/metabolismo
14.
PLoS Comput Biol ; 8(1): e1002330, 2012 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-22241974

RESUMO

Expression quantitative trait loci (eQTL) studies are an integral tool to investigate the genetic component of gene expression variation. A major challenge in the analysis of such studies are hidden confounding factors, such as unobserved covariates or unknown subtle environmental perturbations. These factors can induce a pronounced artifactual correlation structure in the expression profiles, which may create spurious false associations or mask real genetic association signals. Here, we report PANAMA (Probabilistic ANAlysis of genoMic dAta), a novel probabilistic model to account for confounding factors within an eQTL analysis. In contrast to previous methods, PANAMA learns hidden factors jointly with the effect of prominent genetic regulators. As a result, this new model can more accurately distinguish true genetic association signals from confounding variation. We applied our model and compared it to existing methods on different datasets and biological systems. PANAMA consistently performs better than alternative methods, and finds in particular substantially more trans regulators. Importantly, our approach not only identifies a greater number of associations, but also yields hits that are biologically more plausible and can be better reproduced between independent studies. A software implementation of PANAMA is freely available online at http://ml.sheffield.ac.uk/qtl/.


Assuntos
Algoritmos , Mapeamento Cromossômico/métodos , Regulação da Expressão Gênica/genética , Variação Genética/genética , Modelos Genéticos , Modelos Estatísticos , Locos de Características Quantitativas/genética , Animais , Simulação por Computador , Fatores de Confusão Epidemiológicos , Interpretação Estatística de Dados , Humanos , Sensibilidade e Especificidade
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...